Math Review

Chelsea Parlett-Pelleriti

Calculus + Linear Algebra

Derivatives

Derivative: Instantaneous rate of change

Note: We most often care about when derivatives are 0. Why?

Integration

Integrals: Area under the curve \(\int_{-\infty}^{-1} f(x) dx\)

Note: If a curve is a probability distribution (proper) what is the AUC?

Gradients

Gradients

\[f = x^2 + xy + y^2\]

How does \(f(x,y)\) when \(x\) changes? when \(y\) changes?

\[\frac{\partial f}{\partial x} = 2x + y\]

\[\frac{\partial f}{\partial y} = x + 2y\]

Gradients

Shove those in a vector, and you’ve got a gradient.

\[ \begin{bmatrix} \frac{\partial f}{\partial x} \\ \frac{\partial f}{\partial y} \end{bmatrix} = \begin{bmatrix} 2x + y \\ x + 2y \end{bmatrix} \]

Note: what do gradients tell us about a function?

Reading a Contour Plot

Important Matrices

  • covariance
  • correlation
  • hessian
  • design matrix

Covariance and Correlation Matrices

Covariance: Do two variables move together? \[Corr(x,y) = \frac{Cov(x,y)}{sd(x)sd(y)}\]

Hessian

gradients: first derivatives of a multivariable function \(f(x,y)\)

hessians : second derivatives of a multivariable function \(f(x,y)\)

\[\begin{bmatrix} \frac{\partial^2 f}{\partial x^2} & \frac{\partial^2 f}{\partial x \partial y} \\ \frac{\partial f^2}{\partial y \partial x} & \frac{\partial^2 f}{\partial y^2} \\ \end{bmatrix}\]

Note: What does a second derivative tell you in a single variable function \(f(x)\)?

Hessian

🚗

Position: \(r(t)\)

Velocity: \(\frac{dr}{dt}\)

Acceleration: \(\frac{d^2r}{dt^2}\)

Hessian

\[f = x^2 + xy + y^2\]

\[\frac{\partial f}{\partial x} = 2x + y; \frac{\partial f}{\partial y} = x + 2y\] \[\begin{bmatrix} \frac{\partial^2 f}{\partial x^2} & \frac{\partial^2 f}{\partial x \partial y} \\ \frac{\partial f^2}{\partial y \partial x} & \frac{\partial^2 f}{\partial y^2} \\ \end{bmatrix} = \begin{bmatrix} 2 & 1 \\ 1 & 2 \\ \end{bmatrix}\]

Hessian

Where is \(f''(x) = 0\)? positive? negative?

Hessian

Design Matrix

a design matrix in a linear model is a matrix of all the explanatory variables, \(X\).

\[y = X\beta\]

  • simple case: columns of \(X\) are each vectors of predictor variables (e.g. height, age…)

  • complicated case: one-hot encoding, intercept, polynomial terms

Design Matrix

\[\underset{n\times (p+1)}{X} = \begin{bmatrix} 1 & 24 & 170 \\ 1 & 21 & 162 \\ ... & ... & ... \\ 1 & 22 & 150 \\ \end{bmatrix}; \underset{(p+1)\times 1}{\beta} = \begin{bmatrix}3.2 \\ 0.1 \\ 0.02 \end{bmatrix}\]

Design Matrix

Using a design matrix \(X\) makes writing out the math easier. \(Y\) is (usually) a \(1\times n\) vector of responses, \(X\) is a \(n \times (p+1)\) matrix of predictors…plus a little extra.

\[ y = X \beta \] It’s just matrix shorthand for the traditional linear regression formula.

\[ y = \beta_0 + \beta_1x_1 + ...\beta_nx_n \]

Design Matrix

\[ y = \beta_0 + \beta_1x_1 + ...\beta_nx_n \]

🔮 notice in the formula: does changing any of the \(x_n\)s impact the other \(x\)s?

Design Matrix

\[ y = X \beta \]

The design matrix also allows us to write this problem in a format that makes it clear that least squares can help us choose \(\beta\) in a way that minimized \(Y - \beta X\)!

Least Squares

\[ y = X \beta \]

Often, there is no \(\beta\) that can perfectly predict \(y\) (and if there is…maybe it’s overfitting…)

Least Squares

We want to choose a line in the column space of \(X\) that is as close as possible to \(y\) as possible. In other words: minimize \(\parallel y- X\beta \parallel^2\).

Least Squares

The best \(\beta\) we can choose that is both:

  • in the column space of \(X\), and
  • minimizes the distance between \(y\) and \(X\beta\) (the error)

is the projection (“shadow”) of \(y\) on \(C(X)\).

Least Squares

So our best fit is the projection of \(y\) onto the column space \(C(X)\):

\[ X \beta = proj_{C(X)} y \]

Least Squares

A little mathy-math

\[ X\beta - y = proj_{C(X)} y - y \]

Least Squares

\(proj_{C(X)} y - y\) is orthogonal to \(C(X)\), and thus is in the nullspace of \(X^T\).

what does that mean?

\[ X^T(X\beta - y) = 0 \]

\[ X\beta - y \in C(X)^\perp \rightarrow \\ C(X)^\perp = N(X^T) \rightarrow \\ X^T(X\beta - y) = 0 \]

Least Squares

doing a little LA: \[ X^TX\beta - X^Ty = 0 \\ X^TX\beta = X^Ty \\ (X^TX)^{-1}X^TX\beta = (X^TX)^{-1}X^Ty \\ \]

Least Squares

  • visual intuition for minimizing errors
  • why multicollinearity is a problem

(for example, if there is perfect collinearity: total sales, US sales, non-US sales, \(X^TX\) is not invertible, \(X\) is not of full rank, and there exists multiple solutions!)

Determinants

the determinant of \(A\) is the area of the unit square after multiplying by \(A\)

\[ \begin{bmatrix} 1 & 0.5 \\ 0.5 & 1 \end{bmatrix} \]

Determinants

the determinant of \(A\) is the area of the unit square after multiplying by \(A\)

\[ \begin{bmatrix} 1 & 2 \\ 0.5 & 1 \end{bmatrix} \]

Trace

trace: sum of the diagonal elements

\[ \begin{bmatrix} 1 & 2 \\ 0.5 & 1 \end{bmatrix} \]

\(tr \left ( \underset{n \times n}{A} \right ) = \sum_{i=1}^n \lambda_i\)

Important Matrix Decompositions

  • Singular Value
  • Eigen
  • QR
  • Cholesky

Singular Value Decomposition

  • \(U\) an orthogonal matrix; eigenvectors of \(AA^T\)
  • \(D\) is a diagonal matrix; diagonal is root of positive eigenvalues of \(AA^T\) or \(A^TA\)
  • \(V\) an orthogonal matrix; eigenvectors of \(A^TA\)

Singular Value Decomposition

We can use SVD to do PCA by decomposing the data matrix \(X\) rather than eigendecomposing the covariance matrix \(C = \frac{X^TX}{n-1}\) into \(VLV^T\) where \(V\) are the eigenvectors (PCs) and \(L\) contains the eigenvalues.

if \(X = UDV^T\), \(V\) contains the eigenvectors needed to create PCs, and \(D\) contains the (scaled, root) eigenvalues for each PC.

\[ C = \frac{VDU^TUDV^T}{n-1} = \\ V\frac{D^2}{n-1}V^T \]

Singular Value Decomposition

source: Wikipedia

Eigendecomposition

  • \(P\) is a matrix of eigenvectors
  • \(\Lambda\) is a diagonal matrix of eigenvalues

Eigendecomposition

\(Ax\)

Eigendecomposition

\[ B = \begin{bmatrix} 1 & 2 \\ 0.5 & 1 \end{bmatrix}, A = \begin{bmatrix} 1 & 0.5 \\ 0.5 & 1 \end{bmatrix} \]

Eigendecomposition

  • Principal Component Analysis
  • Steady State of Transition Matrix (when \(\lambda = 1\))

\[ Ax = \lambda x \]

m <- matrix(c(1.0, 0.0,
              0.4, 0.6), nrow = 2)
m
     [,1] [,2]
[1,]    1  0.4
[2,]    0  0.6
eigen(m)$values
[1] 1.0 0.6
(eigen(m)$vectors[,1] )/sum(eigen(m)$vectors[,1])
[1] 1 0

QR Decomposition

  • \(Q\): An orthogonal matrix (\(Q^T = Q^{-1}\))
  • \(R\): An upper triangular matrix

QR Decomposition

\[ \begin{bmatrix} 1 & 3\\ 1 & -1 \end{bmatrix} = \begin{bmatrix} \frac{1}{\sqrt2} & \frac{1}{\sqrt2}\\ \frac{1}{\sqrt2} & -\frac{1}{\sqrt2} \end{bmatrix} \begin{bmatrix} \sqrt2 & \sqrt2\\ 0 & 2\sqrt2 \end{bmatrix} \]

QR Decomposition

Solve \(X\beta = y\) subject to \(\arg \underset{\beta}{\min}\left\| X\beta - y \right\|_2\)

Could do

\[ \left( X\beta - y\right)^T\left( X\beta - y\right) \\ \beta^TX^TX\beta - \beta^TX^Ty - y^TX\beta + y^Ty \\ \beta^TX^TX\beta- 2y^TX\beta + y^T y \\ \nabla_\beta [\beta^TX^TX\beta- 2 y^TX\beta + y^Ty] = 2X^TX\beta - 2X^T y \\ 2X^TX\beta - 2X^T y = 0 \\ X^TX\beta = X^Ty \\ \beta = \underbrace{\left(X^TX\right)^{-1}X^T}_\text{Moore-Penrose Inverse} y \]

But finding inverses SUCKS.

QR Decomposition

Substitute \(A = QR\).

\[ A^TAx = A^Tb \to \left(QR \right)^T\left(QR \right)x = \left(QR \right)^Tb \\ R^TQ^TQRx = R^TQ^Tb \\ R^TRx = R^TQ^Tb \\ Rx = Q^Tb \\ Rx = v \]

Since \(R\) is upper triangular, it’s easy to solve with backsubstitution. No inverses!!

(Sidetrack) Back Substitution

\[ x_1 + 2x_2 -x_3 = 3 \\ -2x_2 - 2x_3 = -10 \\ -6x_3 = -18 \] \[ A = \begin{bmatrix} 1 & 2 & -1 \\ 0 & -2 & -2 \\ 0 & 0 & -6 \end{bmatrix}, A\begin{bmatrix}x_1 \\ x_2 \\ x_3 \end{bmatrix} = \begin{bmatrix}3 \\ -10 \\ -18 \end{bmatrix} \]

Cholesky Decomposition

For \(A\), a symmetric matrix,

  • \(L\) is a lower triangular matrix
  • \(L^T\) is an upper triangular matrix, the transpose of \(L\)

Basically the Square Root of a Matrix (and a special case of \(LU\) decomp)

Cholesky Decomposition

Cholesky Decomp can force vectors of random numbers to have a specified covariance!

x <- rnorm(100000)
y <- rnorm(100000)
data <- rbind(x,y)
print(round(cor(x,y),3))
[1] 0
desired_corr <- matrix(c(1,0.5,
                         0.5,1), nrow = 2)
# returns Upper Triangular
L <- chol(desired_corr)
# multiple data by t(L)
Z <- t(L) %*% data
# correlated!
print(round(cor(Z[1,], Z[2,]),3))
[1] 0.502

Cholesky Decomposition

Proof

Remember, \(\Sigma = LL^T\) and \(Z = LX\)

\[ \mathbb{E}\left( ZZ^T\right) = \mathbb{E}\left( (LX)(LX)^T\right) = \\ \mathbb{E}\left( LXX^TL^T\right) = \\ L\mathbb{E}\left(XX^T\right)L^T = \\ LIL^T = LL^T = \Sigma \]

Taylor Series Expansion

Taylor Series Expansion

Taylor Series Expansion

Taylor Series Expansion

Taylor Series Expansion

Taylor Series Expansion

Taylor Series Expansion

Taylor Series Expansion

Taylor Series Expansion

Taylor Series Expansion

How do we find this approximation?

\[ \sum_{n=0}^{\infty} \frac{f^{(n)}(a)}{n!} (x-a)^n \]

Taylor Series Expansion

How do we find this approximation? (we’re matching derivatives!*)

\[ \sum_{n=0}^{\infty} \frac{f^{(n)}(0)}{n!} (x)^n = \frac{f(0)}{0!} + \frac{f'(0)}{1!} x + \frac{f''(0)}{2!} x^2 + \frac{f'''(0)}{3!} x^3 + ... \]

  • matching important characteristics is a common theme: e.g. MOM.

Taylor Series Expansion

Use?

\[ \sum_{n=0}^{\infty} \frac{f^{(n)}(0)}{n!} (x)^n = \frac{f(0)}{0!} + \frac{f'(0)}{1!} x + \frac{f''(0)}{2!} x^2 \]